[INSERT PRESENTATION VIDEO HERE]
Recall from our presentation that one central question to text mining and natural language processing is:
How do we to quantify what a document or collection of documents is about?
For our first lab on text mining in STEM education, we’ll explore this question by examining a corpus, or collection, of public posts on Twitter about the Common Core State Standards (CCSS) to better understand public discourse surrounding these standards, particularly as they relate to math education. Specifically, in this lab we’ll be applying some basic text mining techniques to address the following questions:
What are the most frequent words or phrases used in reference to tweets about the CCSS?
What words and hashtags commonly co-occur together, particularly with the word “math?”
To help us better understand the packages, Twitter API tools, and data we’ll use during this lab to address these questions, in this section we’ll learn to:
Load Packages for tidy text mining and using Twitter APIs
Create a Twitter App to obtain authentication credentials, also known as keys and tokens
Authorize RStudio to use your app for retrieving data from Twitter
Instructions for accessing practice file and completing lab…
Let’s begin by loading some familiar packages from previous Learning Labs that we’ll be using for data wrangling and exploration :
library(dplyr)
library(readr)
library(tidyr)
library(ggplot2)
library(readxl)
library(writexl)
library (DT)
The rtweet package provides users a range of functions designed to extract data from Twitter’s REST and streaming APIs and has three main goals:
Formulate and send requests to Twitter’s REST and stream APIs.
Retrieve and iterate over returned data.
Wrangling data into tidy structures.
Let’s load the rtweet package which we’ll be using later in this lab to accomplish all three of the goals listed above:
library(rtweet)
Before you can begin pulling tweets into R, you’ll first need to create a Twitter App in your developer account. This section and the section that follows, are borrowed largely from the rtweet package by Michael Kearney, and requires that you have set up a Twitter developer account.
You are not required to set up developer account for this institute, but if you are still interested in creating one, these instructions succinctly outline the process and you can set one up in about 10 minutes. We have provided the data we’ll be using for the Wrangling and Explore parts of the lab on our GitHub repository and you can skip to section 2b. Tidy Text if you are not interested in or unable to set up a Twitter developer account.
Navigate to developer.twitter.com/en/apps, click the blue button that says, Create a New App, and then complete the form with the following fields:
App Name: What your app will be called
Application Description: How your app will be described to its users
Website URLs: Website associated with app–I recommend using the URL to your Twitter profile
Callback URLs: IMPORTANT enter exactly the following: http://127.0.0.1:1410
Tell us how this app will be used: Be clear and honest
When you’ve completed the required form fields, click the blue Create button at the bottom
Read through and indicate whether you accept the developer terms
And you’re done!
In general, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). The importance of data wrangling is difficult to overstate, as it involves the initial steps of going from raw data to a dataset that can be explored and modeled (Krumm et al, 2018).
rtweet package to search for and filter tweets and users of interest.tidytext package to both “tidy” and tokenize our tweets in order to create our data frame for analysis.dplyr package to remove words that don’t add much value to our analysis.This section introduces the following functions from the rtweet package for reading Twitter data into R:
search_tweets() Pulls up to 18,000 tweets from the last 6-9 days matching provided search terms. search_tweets2() Returns data from multiple search queries.get_timelines() Returns up to 3,200 tweets of one or more specified Twitter users.Since one of our goals for this Learning Lab and the next is a very simplistic replication of the studies by Wang and Fikis (2019), let’s begin by introducing the search_tweets() function to try reading into R 5,000 tweets containing the CommonCore hashtag and store as a new data frame called ccss_tweets.
Type or copy the following code into your R Markdown file or console and run:
ccss_tweets <- search_tweets(q = "#CommonCore", n=5000)
Note that the first argument q = that the search_tweets() function expects is the search term included in quotation marks and that n = specifies the maximum number of tweets to return.
View your new ccss_tweets data frame using the glimpse() function introduced previously to help answer the following questions:
Wang and Fikis (2019) collected the tweets containing the hashtags #CommonCore and #CCSS for 12 months from 2014 to 2015. Unfortunately, a basic Twitter developer account only lets us go back about a week but retrieving tweets the two hashtags identified by the authors is not an issue.
Let’s modify our query using the OR operator to also include “CCSS” so it will return tweets containing either #NGSSchat or “ngss” and assign to ngss_or_tweets:
ccss_or_tweets <- search_tweets(q = "#commoncore OR #ccss", n=5000)
Try including both search terms but excluding the OR operator to answer the following question:
OR operator return more tweets, the same number of tweets, or fewer tweets? Why?search_tweet() function contain? Try adding one and see what happens.Hint: Use the ?search_tweets help function to learn more about the q argument and other arguments for composing search queries.
Although Wang and Fikis (2019) limited their query to the two hashtags used above, at some point you may be interested in a more complex query that includes additional search terms. Unfortunately, the OR operator only gets us so far. In order to pass multiple queries , we will need to use the c() function to combine our search terms into a single list.
Copy and past the following code to store the results of our query in ngss_tweets:
ccss_tweets <- search_tweets2(c("#commoncore OR #ccss",
'"common core standards"',
'"common core state standards"'),
n=5000,
include_rts = FALSE)
Notice the unique syntax required for the query argument. For example, when “OR” is entered between search terms, query = "#CommonCore OR #CCSS", Twitter’s REST API should return any tweet that contains either “#CommonCore” or “#CCSS.” It is also possible to search for exact phrases using double quotes. To do this, either wrap single quotes around a search query using double quotes, e.g., q = '"common core standards"' as we did above, or escape each internal double quote with a single backslash, e.g., q = "\"common core standards\"".
To learn more about constructing search terms using the query argument, enter ?search_tweets in your console and review the documentation for the q= argument.
search_tweets function to create you own custom query for a twitter hashtag or topic(s) of interest.As you may have noticed, we have way more data than we need for our analysis and should probably pare it down to just what we’ll use.
First, it’s likely the authors removed retweets from their analysis since a retweet is simply a user reposting someone else’s tweet and would duplicate the exact same content of the original. It’s also likely that they limited their analysis to just English language Tweets so let’s go ahead and
Let’s use the filter() function introduced in previous labs to subset rows containing only original tweets in the English language:
ccss_tweets <- ccss_tweets %>% filter(is_retweet == "False",
lang == "en")
Now let’s use the select() function select the following columns from our new ccss_text data frame:
screen_name of the user who created the tweetcreated_at timestamp for examining changes in sentiment over timetext containing the tweet which is our primary data source of interestccss_tweets <- select(ccss_tweets,
screen_name,
created_at,
text)
For the remainder of the lab, you’ll be asked to using your own Twitter data. Complete the following steps before proceeding to the 2b. Tidy Text section:
Creates new code chunk and write a query based on a STEM area of interest.
Subset your data to remove any unnecessary tweets from analysis.
Assign your search to a new object called my_tweets.
Output your new dataset using the datatable() function from the DT package and take a quick look.
Extra credit for using the %>% pipe operator and efficient use of arguments to keep your code succinct and using the <- assignment operator only once.
You’re output should look something like this:
Finally, let’s save our tweet files to use in later exercises since tweets have a tendency to change every minute. We’ll save as a Microsoft Excel file since one of our columns can not be stored in a flat file like .csv.
Let’s use the write_xlxs() function from the writexl package just like we would the write_csv() function from dplyr in Unit 1:
write_xlsx(ccss_tweets, "data/csss_tweets.xlsx")
For your own research, you may be interest in exploring posts by specific users rather than topics, key words, or hashtags. Yes, there is a function for that too!
For example, let’s create another list containing the usernames of the LASER Institute leads using the c() function again and use the get_timelines() function to get the most recent tweets from each of those users:
fi <- c("sbkellogg", "jrosenberg6432", "yanecnu", "robmoore3", "hollylynnester")
fi_tweets <- fi %>%
get_timelines(include_rts=FALSE)
Notice that you can use the pipe operator with the rtweet functions just like you would other functions from the tidyverse.
And let’s use the sample_n() function from the dplyr package to pick 10 random tweets and use select() to select and view just the screenname and text columns that contains the user and the content of their post:
sample_n(fi_tweets, 10) %>%
select(screen_name, text)
## # A tibble: 10 x 2
## screen_name text
## <chr> <chr>
## 1 sbkellogg "@TooSweetGeek Any chance this might be useful? https://t.co/…
## 2 jrosenberg6432 "@lourocconi would love to discuss further! thanks for taking…
## 3 jrosenberg6432 "@racheljcox87 Woah! Me too!"
## 4 robmoore3 "@steph_moore A case study in how not to handle customer issu…
## 5 hollylynnester "Do you #teachstats? Check out this FREE #MOOC-Ed from the @F…
## 6 sbkellogg "Finally, this quote by @AlexDreier perfectly captures not ju…
## 7 jrosenberg6432 "@robmoore3 I think it was the right thing to do across multi…
## 8 robmoore3 "@AmyAndes 🤣🤣🤣"
## 9 jrosenberg6432 "My way of dealing with a tough week - blogging! On backpacki…
## 10 jrosenberg6432 "At a maximally meta level, what has gone wrong with #AERA21?…
The rtweet package also has handy ts_plot function built into rtweet to take a very quick look at how far back our data set goes:
ts_plot(ccss_tweets, by = "days")
Notice that this effectively creates a ggplot time series plot for us. I’ve included the by = argument which by default is set to “days.” It looks like tweets go back 9 days which is the rate limit set by Twitter.
Try changing it to “hours” and see what happens.
To conclude Section 2a, try one of the following search functions from the rtweet vignette:
get_timelines() Get the most recent 3,200 tweets from users.stream_tweets() Randomly sample (approximately 1%) from the live stream of all tweets.get_friends() Retrieve a list of all the accounts a user follows.get_followers() Retrieve a list of the accounts following a user.get_favorites() Get the most recently favorited statuses by a user.get_trends() Discover what’s currently trending in a city.search_users() Search for 1,000 users with the specific hashtag in their profile bios.We’ve only scratched the surface of the number of functions available in the rtweets package for searching Twitter. To learn more about the rtweet package, you can find full documentation on CRAN at: <https://cran.r-project.org/web/packages/rtweet/rtweet.pdf>
Or use the following function to access the package vignette:
vignette("intro", package="rtweet")
Text data, by it’s very nature is ESPECIALLY untidy and sometimes referred to as “unstructured.” The tidytext package provide functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages.
As we’ll Learn first hand later in this lab, using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use. Much of the infrastructure needed for text mining with tidy data frames already exists in tidyverse packages with which we’ve already been introduced.
For a more comprehensive introduction to the tidytext package, we cannot recommend enough the free online book, Text Mining with R: A Tidy Approach by Silge and Robinson (2018).
Let’s go ahead and load tidytext:
library(tidytext)
Attention: From this point forward, we’ll also use a shared dataset constructed with the Twitter Academic Research product track that allows for a much greater number of tweets to be accessed over a far greater period of time. This will also ensure that we’re producing similar results so we can check to see if our code is behaving as expected.
Let’s use the readxl package highlighted in Section 1 and the read_xlsx() function to read in the data stored in the data folder of our R project:
ccss_tweets <- read_csv("data/ccss-tweets.csv",
col_types = cols(id = col_character(),
author_id = col_character()
)
)
In Chapter 1 of Text Mining with R, Silge and Robinson (2018) define the tidy text format as a table with one-token-per-row. A token is a meaningful unit of text, such as a word, two-word phrase (bigram), or sentence that we are interested in using for analysis. And tokenization is the process of splitting text into tokens.
This one-token-per-row structure is in contrast to the ways text is often stored for text analysis, perhaps as strings in a corpus object or in a document-term matrix. For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph.
For this part of our workflow, our goal is to transform our ccss_tweets data from this:
## # A tibble: 11,592 x 4
## text id author_id created_at
## <chr> <chr> <chr> <dttm>
## 1 "Textual Evidence in #GatheringB… 108537318… 93237206838… 2019-01-16 03:08:27
## 2 "11 PM ET Thursday live 805-285-… 108533640… 38784416 2019-01-16 00:42:18
## 3 "11 PM ET Thursday live 805-285-… 108533633… 85409136 2019-01-16 00:42:00
## 4 "Petrilli /Forham were the ones … 108532985… 253343991 2019-01-16 00:16:16
## 5 "This is rich.\nLet's look at th… 108532928… 253343991 2019-01-16 00:14:00
## 6 "This new math will always leave… 108532139… 25189615 2019-01-15 23:42:39
## 7 "Our free #lesson gives a sample… 108528177… 91497123610… 2019-01-15 21:05:13
## 8 "With a focus on the complete wr… 108528172… 281718510 2019-01-15 21:05:00
## 9 "Super excited to learn from @ju… 108526899… 395853723 2019-01-15 20:14:25
## 10 "Find ideas for how to teach &am… 108526204… 99865113846… 2019-01-15 19:46:48
## # … with 11,582 more rows
Into a “tidy text” format that looks more like this familiar tibble data structure:
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 169,711 x 4
## id author_id created_at word
## <chr> <chr> <dttm> <chr>
## 1 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 textual
## 2 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 evidence
## 3 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #gatheringblue
## 4 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #ccss
## 5 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #rl71
## 6 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #rl72
## 7 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #w72
## 8 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #middleschool
## 9 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #interactivenotebo…
## 10 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 🍎📚📝✅💯❤️🚀
## # … with 169,701 more rows
Later we’ll learn about other data structures for text analysis like the document-term matrix and corpus objects. For now, however, working with the familiar tidy data frame allows us to take advantage of popular packages that use the shared tidyverse syntax and principles for wrangling, exploring, and modeling data.
The tidytext package provides the incredibly powerful unnest_tokens() function to tokenize text (including tweets!) and convert them to a one-token-per-row format.
Let’s tokenize our tweets by using this function to split each tweet into a single row to make it easier to analyze:
ccss_unigrams <- unnest_tokens(ccss_tweets,
output = word,
input = text)
There is A LOT to unpack with this function. First notice that unnest_tokens() expects a data frame as the first argument, followed by two column names. The second argument is an output column name that doesn’t currently exist but will be created as the text is “unnested” into it (word, in this case). This is followed by the input column that the text comes from, which we uncreatively named text. Also notice:
By default, a token is an individual word or unigram.
Other columns, such as screen_name and created_at, are retained.
All punctuation has been removed.
Tokens have been changed to lowercase, which makes them easier to compare or combine with other datasets (use the to_lower = FALSE argument to turn off if desired).
The unnest_tokens() function also has specialized “tweets” tokenizer in the tokens = argument that is very useful for dealing with Twitter text in that it retains hashtags and mentions of usernames with the @ symbol.
Rewrite the code below to include the token argument set to:
ccss_unigrams <- unnest_tokens(ccss_tweets,
output = word,
input = text,
_____ = _____)
Your output should look something like this:
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 289,390 x 4
## id author_id created_at word
## <chr> <chr> <dttm> <chr>
## 1 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 textual
## 2 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 evidence
## 3 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 in
## 4 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #gatheringblue
## 5 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #ccss
## 6 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #rl71
## 7 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #rl72
## 8 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #w72
## 9 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #middleschool
## 10 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #interactivenotebo…
## # … with 289,380 more rows
In the function above we specified tokens as individual words, but many interesting text analyses are based on the relationships between words, which words tend to follow others immediately, or words that tend to co-occur within the same documents.
We can also use the unnest_tokens() function to tokenize our tweets into consecutive sequences of words, called n-grams. By seeing how often word X is followed by word Y, we can then build a model of the relationships between them as well see in Section 4. MODEL.
We do this by adding the token = "ngrams" option to unnest_tokens(), and setting n to the number of words in each n-gram. Let’s set n to 2, so we can examine pairs of two consecutive words, often called “bigrams”:
ccss_bigrams <- ccss_tweets %>%
unnest_tokens(bigram,
text,
token = "ngrams",
n = 2)
ccss_bigrams
## # A tibble: 301,118 x 4
## id author_id created_at bigram
## <chr> <chr> <dttm> <chr>
## 1 10853731852631… 932372068380880… 2019-01-16 03:08:27 textual evidence
## 2 10853731852631… 932372068380880… 2019-01-16 03:08:27 evidence in
## 3 10853731852631… 932372068380880… 2019-01-16 03:08:27 in gatheringblue
## 4 10853731852631… 932372068380880… 2019-01-16 03:08:27 gatheringblue ccss
## 5 10853731852631… 932372068380880… 2019-01-16 03:08:27 ccss rl71
## 6 10853731852631… 932372068380880… 2019-01-16 03:08:27 rl71 rl72
## 7 10853731852631… 932372068380880… 2019-01-16 03:08:27 rl72 w72
## 8 10853731852631… 932372068380880… 2019-01-16 03:08:27 w72 middleschool
## 9 10853731852631… 932372068380880… 2019-01-16 03:08:27 middleschool interactiv…
## 10 10853731852631… 932372068380880… 2019-01-16 03:08:27 interactivenotebook 2ks…
## # … with 301,108 more rows
Before we move any further let’s take a quick look at the most common unigrams and bigrams in our two datasets:
ccss_unigrams %>%
count(word, sort = TRUE)
## # A tibble: 39,703 x 2
## word n
## <chr> <int>
## 1 #commoncore 9852
## 2 the 8508
## 3 to 6858
## 4 of 4821
## 5 and 4129
## 6 a 3686
## 7 is 3567
## 8 in 3224
## 9 for 3015
## 10 this 2415
## # … with 39,693 more rows
ccss_bigrams %>%
count(bigram, sort = TRUE)
## # A tibble: 138,876 x 2
## bigram n
## <chr> <int>
## 1 https t.co 10538
## 2 commoncore https 940
## 3 common core 935
## 4 commoncore math 749
## 5 of the 735
## 6 in the 504
## 7 the commoncore 498
## 8 commoncore commonsense 497
## 9 of writing 492
## 10 say https 491
## # … with 138,866 more rows
Well, many of these tweets are clearly about the common core, but beyond that it’s a bit hard to tell because there are so many “stop words” like “the,” “to,” “and,” “in” that don’t carry much meaning by themselves.
Often in text analysis, we will want to remove these stop words if they are not useful for an analysis. The stop_words dataset in the tidytext package contains stop words from three lexicons. We can use them all together, as we have here, or filter() to only use one set of stop words if that is more appropriate for a certain analysis.
Let’s take a closer the lexicons and stop words included in each:
datatable(stop_words)
anti_join FunctionIn order to remove these stop words, we will use a function called anti_join() that looks for matching values in a specific column from two datasets and returns rows from the original dataset that have no matches like so:
For a good overview of the different dplyr joins see here: https://medium.com/the-codehub/beginners-guide-to-using-joins-in-r-682fc9b1f119
Now let’s remove stop words that don’t help us learn much about what people are saying about the state standards.
tidy_unigrams <- anti_join(ccss_unigrams,
stop_words,
by = "word")
tidy_unigrams
## # A tibble: 169,711 x 4
## id author_id created_at word
## <chr> <chr> <dttm> <chr>
## 1 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 textual
## 2 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 evidence
## 3 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #gatheringblue
## 4 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #ccss
## 5 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #rl71
## 6 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #rl72
## 7 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #w72
## 8 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #middleschool
## 9 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 #interactivenotebo…
## 10 1085373185263128576 9323720683808808… 2019-01-16 03:08:27 🍎📚📝✅💯❤️🚀
## # … with 169,701 more rows
Notice that we’ve specified the by = argument to look for matching words in the word column for both data sets and remove any rows from the tweet_tokens dataset that match the stop_words dataset. Remember when we first tokenized our dataset I conveniently chose output = word as the column name because it matches the column name word in the stop_words dataset contained in the tidytext package. This makes our call to anti_join()simpler because anti_join() knows to look for the column named word in each dataset. However this wasn’t really necessary since word is the only matching column name in both datasets and it would have matched those columns by default.
As we saw above, a lot of the most common bigrams are pairs of common (uninteresting) words as well. Dealing with these is a little less straightforward and we’ll need to use the separate() function from the tidyr package, which splits a column into multiple based on a delimiter. This lets us separate it into two columns, “word1” and “word2,” at which point we can remove cases where either is a stop-word.
library(tidyr)
bigrams_separated <- ccss_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
tidy_bigrams <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
Before wrapping up, let’s take a quick count of the most common unigrams and bigrams to see if the results are a little more meaningful:
tidy_unigrams %>%
count(word, sort = TRUE)
## # A tibble: 39,086 x 2
## word n
## <chr> <int>
## 1 #commoncore 9852
## 2 #ccss 1843
## 3 math 1750
## 4 amp 1520
## 5 #education 1064
## 6 students 1053
## 7 common 1011
## 8 core 987
## 9 education 850
## 10 standards 846
## # … with 39,076 more rows
tidy_bigrams %>%
count(bigram, sort = TRUE)
## # A tibble: 63,513 x 2
## bigram n
## <chr> <int>
## 1 https t.co 10538
## 2 commoncore https 940
## 3 common core 935
## 4 commoncore math 749
## 5 commoncore commonsense 497
## 6 commonsense knowledge 488
## 7 compact style 488
## 8 power site:is 488
## 9 writing common 488
## 10 t.co sxw9rhgwzm 470
## # … with 63,503 more rows
Notice that the nonsense word “amp” is among our high frequency words. Let’s add a filter to our previous code similar to what we did with our bigrams to remove rows with “amp” in them:
tidy_unigrams <-
ccss_unigrams %>%
anti_join(stop_words, by = "word") %>%
filter(!word == "amp")
Note that we could extend this filter to weed out any additional words that don’t carry much meaning but skew our data by being so prominent.
Tidy your my_tweets dataset from the ✅ Comprehension Check in Section 2a by tokenizing your text into unigrams and removing stop words.
Also, since we created some unnecessarily lengthy code to demonstrate some of the steps in the tidying process, try to use a more compact series of functions and assign your data frame to my_tidy_tweets.
As highlighted in DSEIUR and Learning Analytics Goes to School, calculating summary statistics, data visualization, and feature engineering (the process of creating new variables from a dataset) are a key part of exploratory data analysis. In Section 3, we will calculate some very basic summary statistics from our tidied text, explore key words of interest to gather additional context, and use data visualization to identify patterns and trends that may not be obvious from our tables and numerical summaries. Topics addressed in Section 3 include::
c
Time Series. We take a quick look at the date range of our tweets and compare number of postings by standards.
People new to text mining are often disillusioned when they figure out how it’s actually done — which is still, in large part, by counting words. They’re willing to believe that computers have developed some clever strategy for finding patterns in language — but think “surely it’s something better than that?”
The quote above from Word Counts are Amazing by Ted Underwood…
Recall from the previous section that one overarching question guiding most of our efforts in these text mining labs is: “How do we quantify what a text is about?”
library(wordcloud2)
tidy_unigrams %>%
count(word) %>%
filter(n > 200) %>%
wordcloud2()
tidy_unigrams %>%
count(word) %>%
filter(n > 500) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word)) +
geom_col()
ccss_tweets %>%
select(text) %>%
filter(grepl('math', text)) %>%
sample_n(20)
## # A tibble: 20 x 1
## text
## <chr>
## 1 "#MathTeachers check out #STEMvideohall & learn how these federally fund…
## 2 "FRACTIONS NUMBER TALK\n\nWhat fraction of each square is:\n A. B. \n🟩…
## 3 "Any of my common core teachers out there need a nautical themed math practi…
## 4 "Volume Task Cards by Making it through Middle School | TpT https://t.co/Xw5…
## 5 "After the pandemic ends, the one and only thing that can unite us is a figh…
## 6 "Is this #CommonCore math? The US has ~3.1 million confirmed cases. Curren…
## 7 "How many newspaper reporters does it take to do fifth grade math? Thanks to…
## 8 "#nextgenmath offers a real time data driven blended learning function that …
## 9 "Alabama's #math and #reading scores are some of the worst in the country. S…
## 10 "Ever want to show students how to add and subtract using the number one up …
## 11 "Hey #mathteachers , an #englishteacher needs your help! What #CCSS #highsc…
## 12 "Nice thread. #CCSS #mathchat #math #STEMeducation https://t.co/Ts2PHBcKPz"
## 13 "@PubliusDB Common Core math applied! Until Legislators at Fed and State l…
## 14 "Obama's #CommonCore math https://t.co/gQ6tStLftz"
## 15 "Through a cross-district community of practice, Math in Common brought toge…
## 16 "@Michael4Tune @KurtSchlichter @vjeannek Especially since the ONLY math ther…
## 17 "@ECityMom @JohnFashionGuy @davideggenAB @DShepYEG IF you would have looked …
## 18 "@Nate27301510 @NicoleArbour #CommonCore math.. The definitive decline of …
## 19 "I love how every parent in #florida is acting like math won’t still be conf…
## 20 "STRUGGLE FREE DIVISION AND MULTIPLICATION OF FRACTIONS by Number Sense Guy …
unigram_counts <- tidy_unigrams %>%
count(word) %>%
filter(n > 200)
wordcloud2(unigram_counts,
color = ifelse(unigram_counts[, 2] > 800, 'black', 'gray'))
tidy_bigrams %>%
count(bigram, sort = TRUE)
## # A tibble: 63,513 x 2
## bigram n
## <chr> <int>
## 1 https t.co 10538
## 2 commoncore https 940
## 3 common core 935
## 4 commoncore math 749
## 5 commoncore commonsense 497
## 6 commonsense knowledge 488
## 7 compact style 488
## 8 power site:is 488
## 9 writing common 488
## 10 t.co sxw9rhgwzm 470
## # … with 63,503 more rows
library(igraph)
##
## Attaching package: 'igraph'
## The following object is masked from 'package:tidyr':
##
## crossing
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(ggraph)
library(stringr)
math_bigrams <- ccss_tweets %>%
filter(str_detect(text, 'math')) %>%
unnest_tokens(bigram,
text,
token = "ngrams",
n = 2)
bigrams_separated <- math_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigram_graph <- bigrams_filtered %>%
count(word1, word2, sort = TRUE) %>%
filter(n > 10) %>%
graph_from_data_frame()
set.seed(2017)
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n)) +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
library(widyr)
math_unigrams <- ccss_tweets %>%
filter(str_detect(text, 'math')) %>%
unnest_tokens(word,
text,
token = "tweets")
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
word_pairs <- math_unigrams %>%
pairwise_count(word, id, sort = TRUE)
## Warning: `distinct_()` was deprecated in dplyr 0.7.0.
## Please use `distinct()` instead.
## See vignette('programming') for more help
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
word_cors <- math_unigrams %>%
group_by(word) %>%
filter(n() >= 20) %>%
pairwise_cor(word, id, sort = TRUE)
word_cors %>%
filter(item1 == "math")
## # A tibble: 331 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 math i 0.141
## 2 math they 0.116
## 3 math core 0.109
## 4 math is 0.104
## 5 math common 0.103
## 6 math must 0.0933
## 7 math thats 0.0930
## 8 math think 0.0918
## 9 math do 0.0914
## 10 math me 0.0897
## # … with 321 more rows
word_cors <- tidy_unigrams %>%
group_by(word) %>%
filter(n() >= 50) %>%
pairwise_cor(word, id, sort = TRUE)
word_cors %>%
filter(item1 %in% c("math", "#math")) %>%
group_by(item1) %>%
slice_max(correlation, n = 6) %>%
ungroup() %>%
mutate(item2 = reorder(item2, correlation)) %>%
ggplot(aes(item2, correlation)) +
geom_bar(stat = "identity") +
facet_wrap(~ item1, scales = "free") +
coord_flip()
word_cors %>%
filter(correlation > .15) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void()
## Warning: ggrepel: 186 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps